< Other articles

Solutions of xchg rax,rax

AuthorAlexandro Sanchez Date2016-10-12

Introduction

In words of xorpd, the author of xchg rax,rax:

xchg rax,rax is a collection of assembly gems and riddles I found over many years of reversing and writing assembly code. The book contains 0x40 short assembly snippets, each built to teach you one concept about assembly, math or life in general.

Be warned - This book is not for beginners. It doesn't contain anything besides assembly code, and therefore some x86_64 assembly knowledge is required.

How to use this book? Get an assembler (Yasm or Nasm is recommended), and obtain the x86_64 instruction set. Then for every snippet, try to understand what it does. Try to run it with different inputs if you don't understand it in the beginning. Look up for instructions you don't fully know in the Instruction sets PDF. Start from the beginning. The order has meaning.

As a final note, the full contents of the book could be viewed for free on my website (Just google "xchg rax,rax").

The original release, which can be read online at [1], contains no official solutions, and some of the snippets doesn't even seem to yield a clearly defined "answer". Also, in his own words:

Is that the content of your book? Some assembly language instructions without comments?

Yes.

Is that a bad joke?

No, arranging almost meaningless sequences of assembler instructions against a black background is a form of art. You may call it nerd poetry.

Nevertheless, I recovered from old backups my own thoughts and solutions for some of the snippets, and uploaded them just in case it could be useful or interesting for someone.

[1] http://www.xorpd.net/pages/xchg_rax/snip_00.html

Solutions

Snippet 0x00

    xor      eax,eax
    lea      rbx,[0]
    loop     $
    mov      rdx,0
    and      esi,0
    sub      edi,edi
    push     0
    pop      rbp

Different ways of setting several general purpose registers to 0.

Snippet 0x01

.loop:
    xadd     rax,rdx
    loop     .loop

Computes the rcx-th term of the Fibonacci sequence, assuming the initial state rax=0, rdx=1.

Snippet 0x02

    neg      rax
    sbb      rax,rax
    neg      rax

Boolean cast: rax := bool(rax), i.e. rax := rax ? 1 : 0.

Snippet 0x03

    sub      rdx,rax
    sbb      rcx,rcx
    and      rcx,rdx
    add      rax,rcx

Minimum function: rax := min(rax, rdx).

Snippet 0x04

    xor      al,0x20

Replaces uppercase with lowercase characters and vice-versa.

Snippet 0x05

    sub      rax,5
    cmp      rax,4

Allows to branch depending on whether rax is in range [5,9] using only one jbe jump.

Snippet 0x06

    not      rax
    inc      rax
    neg      rax

Does nothing since the instructions cancel each other.

Snippet 0x07

    inc      rax
    neg      rax
    inc      rax
    neg      rax

Does nothing since the instructions cancel each other.

Snippet 0x08

    add      rax,rdx
    rcr      rax,1

Computes the average, i.e. rax := (rax + rdx) / 2.

Note that it prevents overflow issues since rcr does a 33-bit rotation using the CF flag.

Snippet 0x09

    shr      rax,3
    adc      rax,0

Computes rax := (rax + 4) / 8.

This calculates rax / 8 rounded to the nearest integer (thanks @ZaneH).

Snippet 0x0A

    add      byte [rdi],1
.loop:
    inc      rdi
    adc      byte [rdi],0
    loop     .loop

Increments by one an arbitrarily long little-endian integer at rdi.

Snippet 0x0B

    not      rdx
    neg      rax
    sbb      rdx,-1

Computes the negation of the 128-bit integer stored in the RDX:RAX registers (thanks Aviya Erenfeld!).

Snippet 0x0C

    mov      rcx,rax
    xor      rcx,rbx
    ror      rcx,0xd

    ror      rax,0xd
    ror      rbx,0xd
    xor      rax,rbx

    cmp      rax,rcx

Registers rax, rcx end up with the same value, thanks to distributivity of ROR (with XOR).

rcx = rax
rcx = (rcx ^ rbx) >> 13
rax = (rax >> 13) ^ (rbx >> 13)

Snippet 0x0D

    mov      rdx,rbx

    xor      rbx,rcx
    and      rbx,rax

    and      rdx,rax
    and      rax,rcx
    xor      rax,rdx

    cmp      rax,rbx

Registers rdx, rbx end up with the same value, thanks to distributivity of AND (with XOR) and commutativity of XOR.

rdx = rbx
rbx = (rbx & rax) ^ (rcx & rax)  // Associativity of AND
rax = (rdx & rax) ^ (rcx & rax)  // Commutativity of XOR

Snippet 0x0E

    mov      rcx,rax
    and      rcx,rbx
    not      rcx

    not      rax
    not      rbx
    or       rax,rbx

    cmp      rax,rcx

Registers rax, rcx end up with the same value, thanks to DeMorgan's law.

rcx = rax
rcx = ~(rcx & rbx)
rax = ~rax | ~rbx

Snippet 0x0F

.loop:
    xor      byte [rsi],al
    lodsb
    loop     .loop

Computes the following:

rsi[0] ^= al
rsi[1] ^= rsi[0]
rsi[2] ^= rsi[1]
rsi[3] ^= rsi[2]
...

This resembles a 8-bit CBC encryption scheme using al as IV, except that it uses the identity function for block cipher encryption, not a pseudorandom function with a key. Instead, the IV plays the role as key.

This was the staple home-made "crypto" in the early 80s home computer era.

Snippet 0x10

    push     rax
    push     rcx
    pop      rax
    pop      rcx

    xor      rax,rcx
    xor      rcx,rax
    xor      rax,rcx

    add      rax,rcx
    sub      rcx,rax
    add      rax,rcx
    neg      rcx

    xchg     rax,rcx

Different ways of swapping the contents of rax and rcx.

Snippet 0x11

.loop:
    mov      dl,byte [rsi]
    xor      dl,byte [rdi]
    inc      rsi
    inc      rdi
    or       al,dl
    loop     .loop

Compares two buffers rsi and rdi of length rcx. Assuming al is zero-initialized, it will remain as 0 unless the buffers differ.

Snippet 0x12

    mov      rcx,rdx
    and      rdx,rax
    or       rax,rcx
    add      rax,rdx

Computes (rax | rdx) + (rax & rdx), which can be simplified to rax + rdx.

See also Rich Schroeppel's Item 23 in HAKMEM.

Snippet 0x13

    mov      rcx,0x40
.loop:
    mov      rdx,rax
    xor      rax,rbx
    and      rbx,rdx
    shl      rbx,0x1
    loop     .loop

Computes rax + rbx.

See also: https://en.wikipedia.org/wiki/Adder_(electronics)

Snippet 0x14

    mov      rcx,rax
    and      rcx,rdx

    xor      rax,rdx
    shr      rax,1

    add      rax,rcx

Computes the average rounded to the lowest integer, i.e. rax := floor((rax + rdx) / 2), avoiding overflows (see also Snippet 0x08).

Snippet 0x15

    mov      rdx,0xffffffff80000000
    add      rax,rdx
    xor      rax,rdx

Casts the int32_t value in eax to an int64_t value in rax.

Note that this requires the 32 most significant bits in rax to be cleared.

Snippet 0x16

    xor      rax,rbx
    xor      rbx,rcx
    mov      rsi,rax
    add      rsi,rbx
    cmovc    rax,rbx
    xor      rax,rbx
    cmp      rax,rsi

Computes the following:

rsi := (rax ^ rbx) + (rbx ^ rcx)
if (overflew(rsi))
    rax := 0
else
    rax := rax ^ rcx
cmp(rax, rsi)

Observations: - If rax == rbx: No overflow. - If rcx == rbx: No overflow. - If rax == ~rbx: Always overflows, except when rbx == rcx. - If rbx == rax|rcx: No overflow. - If rbx == rax&rcx: No overflow.

TODO: No idea about this one.

Snippet 0x17

    cqo
    xor      rax,rdx
    sub      rax,rdx

Computes the absolute value of rax, i.e. rax := abs(rax).

Snippet 0x18

    rdtsc
    shl      rdx,0x20
    or       rax,rdx
    mov      rcx,rax

    rdtsc
    shl      rdx,0x20
    or       rax,rdx

    cmp      rcx,rax

Compares two consecutively obtained timestamps.

The instruction rdtsc stores the current 64-bit timestamp counter value in edx:eax, while the shl and or instructions aggregate the halves of each register into rax. Trivially, the second timestamp will always be larger than the first one.

Snippet 0x19

    call     .skip
    db       'hello world!',0
.skip:
    call     print_str
    add      rsp,8

Calls print_str("hello world!");.

The return value of .skip is the address to the hardcoded string, and due to the stack layout, this will implicitly become the first argument of print_str.

This was a very common pattern in assembly language programs: The instruction sequence would be generated by a macro allowing coders to embed commands like PRINT_STR 'Hello, world!' in their code without having to define the string in the data segment, assign a symbol to it and then reference from code.

Snippet 0x1A

    call     .next
.next:
    pop      rax

Gets the rip register after the call instruction, i.e. rax := .next. This is needed to obtain the current instruction pointer in Position-Independent Code.

Snippet 0x1B

    push     rax
    ret

Indirect branch to rax. Since there's no immediate arguments to cause further stack cleanup, this is equivalent to jmp rax.

Snippet 0x1C

    pop      rsp

Swaps the stack pointer with the address at the top of the current stack. One of the many stack pivot gadgets used during Return-Oriented Programming.

Snippet 0x1D

    mov      rsp,buff2 + n*8 + 8
    mov      rbp,buff1 + n*8
    enter    0,n+1

Copies buff1 to buff2. The extra +1 and +8 take care of the side effects of enter, such as the extra values added before and after the buffer.

Note that the buff2 should be 16 bytes larger than buff1, specifically with 8 bytes of padding at the beginning and at the end to prevent OOB writes after executing the enter instruction.

Snippet 0x1E

    cmp      al,0x0a
    sbb      al,0x69
    das

Maps each value in range [0x0, 0xF] stored in al into its hexadecimal representation, i.e. [0x30, ..., 0x39, 0x41, ..., 0x46].

Snippet 0x1F

.loop:
    bsf      rcx,rax
    shr      rax,cl
    cmp      rax,1
    je       .exit_loop
    lea      rax,[rax + 2*rax + 1]
    jmp      .loop
.exit_loop:

Computes the Collatz sequence for any starting number stored in rax.

See also: https://en.wikipedia.org/wiki/Collatz_conjecture

Snippet 0x20

    mov      rcx,rax
    shl      rcx,2
    add      rcx,rax
    shl      rcx,3
    add      rcx,rax
    shl      rcx,1
    add      rcx,rax
    shl      rcx,1
    add      rcx,rax
    shl      rcx,3
    add      rcx,rax

Computes rcx := 1337 * rax.

Snippet 0x21

    mov      rsi,rax
    add      rax,rbx
    mov      rdi,rdx
    sub      rdx,rcx
    add      rdi,rcx

    imul     rax,rcx
    imul     rsi,rdx
    imul     rdi,rbx

    add      rsi,rax
    mov      rbx,rsi
    sub      rax,rdi

Computes the following after corresponding simplifications:

rax := rax * rcx - rbx * rdx
rbx := rax * rdx + rbx * rcx

This corresponds to the multiplication of two complex numbers: (a + bi) * (c + di) = (ac - bd) + (ad + bc)i.

Note that it is using only three multiplications to calculate the four distinct products in the result. This is the essence of the Karatsuba Algorithm for polynomial multiplication (thanks @eleemosynator).

Snippet 0x22

    mov      rdx,0xaaaaaaaaaaaaaaab
    mul      rdx
    shr      rdx,1
    mov      rax,rdx

Computes rax := rax / 3 rounding to the closest integer.

Snippet 0x23

.loop:
    cmp      rax,5
    jbe      .exit_loop
    mov      rdx,rax
    shr      rdx,2
    and      rax,3
    add      rax,rdx
    jmp      .loop
.exit_loop:

    cmp      rax,3
    cmc
    sbb      rdx,rdx
    and      rdx,3
    sub      rax,rdx

Computes rax := rax % 3.

Note that 22k = 1 (mod 3) and 22k+1 = 2 (mod 3). Hence, in order to calculate rax % 3 one can sum the digits of the number written in base four (i.e. sum pairs of bits at even locations) and take the result modulo 3.

There is also a cute bit of code at the end that does the final reduction modulo 3 without any branches. This trick is the base 4 equivalent of using the sum of the digits of a number in base 10 to check for divisibility by 3. Both tricks work in the same way because both 10 and 4 leave a remainder of 1 when divided by 3 (thanks @eleemosynator).

Snippet 0x24

    mov      rbx,rax
    mov      rsi,rax
.loop:
    mul      rbx
    mov      rcx,rax

    sub      rax,2
    neg      rax
    mul      rsi
    mov      rsi,rax

    cmp      rcx,1
    ja       .loop
.exit_loop:

This computes the multiplicative inverse of rax modulo 264 using the Newton-Raphson algorithm, which we can write as follows:

uint64_t multiplicative_inverse_mod_2_64(uint64_t x)
{
    uint64_t z = x;
    uint64_t t = 0;
    do {
        t = x * z;
        z = z * (2 - t);
    } while (t > 1);
    return z;
}

This paper has a good analysis: https://arxiv.org/pdf/1209.6626.pdf.

Notice also that the loop exit condition takes advantage of the intermediate multiplication x * z to early out and also avoid having to maintain a separate loop counter. Also note that the loop condition is checked at the end - this trades off the cost of an extra couple of operations at the end of the algorithm against adding an extra branch in the middle of the loop which would be more expensive on modern architectures (thanks @eleemosynator).

Snippet 0x25

    xor      eax,eax
    mov      rcx,1
    shl      rcx,0x20
.loop:
    movzx    rbx,cx
    imul     rbx,rbx

    ror      rcx,0x10
    movzx    rdx,cx
    imul     rdx,rdx
    rol      rcx,0x10

    add      rbx,rdx
    shr      rbx,0x20
    cmp      rbx,1
    adc      rax,0
    loop     .loop

Register rax acts as a counter, and rcx loops from 0x100000000 to 0x0. Each iteration will split the ecx values in two 16-bit halves, x and y, and verify the following property:

(x*x + y*y) >> 20 == 1

Since x and y are each 16-bit, x*x + y*y must be in range [0x0, 0x1FFFC0002], thus (x*x + y*y) >> 20 must be either 0 or 1. Therefore, each iteration is actually verifying that: x*x + y*y >= 0x100000000. This is equivalent to x^2 + y^2 >= 0x10000^2.

Since (x,y) iterate each in range [0x0000, 0xFFFF] as rcx decreases. The value stored in rax after the snippet has been executed will be the number of points in [0, 65535]^2 lying outside the circumference of radius 65536 centered at the origin.

This ratio of points will be 1 - pi/4, yielding an expected value of rax := 0x100000000 * (1 - pi/4).

Snippet 0x26

    mov      rdx,rax
    shr      rax,7
    shl      rdx,0x39
    or       rax,rdx

Rotates the value in the rax register 7 bits to the right. Equivalent to ror rax, 7.

Snippet 0x27

    mov      ch,cl
    inc      ch
    shr      ch,1
    shr      cl,1
    shr      rax,cl
    xchg     ch,cl
    shr      rax,cl

This is rax >> (floor(cl / 2) + floor((cl + 1) / 2)) which will be identical to rax >> cl when cl is below 64. Once cl exceeds 64, this snippet will return 0 whereas rax >> cl will return rax shifted right by cl % 64. They will then come back in line once cl exceeds 128, and then the whole process will repeat itself. Sometimes, two halves do not a whole make.

Snippet 0x28

    clc
.loop:
    rcr      byte [rsi],1
    inc      rsi
    loop     .loop

Right-shifts by one an entire buffer at rsi with a length of rcx bytes. This can also be interpreted as right-shifting by one, i.e. dividing by two, an arbitrarily long big-endian 8*rcx-bit integer at rsi.

Snippet 0x29

    lea      rdi,[rsi + 3]
    rep movsb

Repeats the first 3 bytes at rsi starting from offset rsi+3 until rcx bytes have been written.

This could be used to fill a texture of 24-bit pixels with a constant color stored in the first pixel of the buffer. In this case, the first 3 bytes would be written, and then the snippet would be executed with rcx := 3 * width * height - 3.

Snippet 0x2A

    mov      rsi,rbx
    mov      rdi,rbx
.loop:
    lodsq
    xchg     rax,qword [rbx]
    stosq
    loop     .loop

This moves the last element of the array of rcx keywords pointed by rbx to the front.

For instance, let rcx := 4 and rbx point to [Q0, Q1, Q2, Q3, Q4] with quadwords Qi. Then, after executing this snippet the contents of rbx will be: [Q4, Q0, Q1, Q2, Q3].

Snippet 0x2B

    xor      eax,eax
    xor      edx,edx
.loop1:
    xlatb
    xchg     rax,rdx
    xlatb
    xlatb
    xchg     rax,rdx
    cmp      al,dl
    jnz      .loop1

    xor      eax,eax
.loop2:
    xlatb
    xchg     rax,rdx
    xlatb
    xchg     rax,rdx
    cmp      al,dl
    jnz      .loop2

Cycle-finding using Floyd's Algorithm. This snippet assumes that rbx points to a table of values of a function that has a single byte argument and single byte value. Consider the sequence xn=f(xn-1). For any starting value this sequence will eventually start cycling as f only takes 256 distinct values and has no memory. The first half of the snippet executes Floyd's Tortoise and Hare algorithm to find a collision point inside the cycle and the second part locates the head of the cycle which is the result of the snippet. To see this, assume that the sequence has a starting segment of length l followed by a cycle of length n. If our Tortoise and Hare collide at step s, then their locations within the cycle must be identical and we must have:

s - l = 2 s - l (mod n) => s = 0 (mod n)

Which implies that the Tortoise's location within the cycle is equal to -l modulo n, Hence, if we start another Tortoise from the initial point and have both Tortoises walk at the same rate, they will collide at the head of the cycle.

In C this looks like:

typedef unsigned char byte;

byte snippet_0x2b(byte *tbl, byte x0)   // tbl in EBX, x0 is set zero
{
   byte t, h;                         // Tortoise and Hare
   t = h = x0;
   do {
       t = tbl[t];
       h = tbl[tbl[h]];
    } while (t != h);

    for (byte t2 = x0; t2 != t;) {
       t  = tbl[t];
       t2 = tbl[t2];
    }

    return t;
}

(thanks @eleemosynator)

Snippet 0x2C

    mov      qword [rbx + 8*rcx],0
    mov      qword [rbx + 8*rdx],1
    mov      rax,qword [rbx + 8*rcx]

    mov      qword [rbx],rsi
    mov      qword [rbx + 8],rdi
    mov      rax,qword [rbx + 8*rax]

This will move rsi or rdi into rax depending on whether rcx and rdx are different or not, respectively. This is equivalent to: rax := (rcx == rdx) ? rdi : rsi. This assumes that all rbx-relative offsets point to valid memory. Effectively a comparison and conditional move implemented using just mov instructions.

The M/o/Vfuscator project takes this concept to a whole new extreme and delivers a C compiler the produces binaries which only contain mov instructions (with a tiny amount of cheating for loops).

Snippet 0x2D

    mov      rdx,rax
    dec      rax
    and      rax,rdx

Determines if rax is a power of two by computing rax := rax & (rax - 1). The result will be zero if and only if rax is a power of two.

Snippet 0x2E

    mov      rdx,rax
    dec      rdx
    xor      rax,rdx
    shr      rax,1
    cmp      rax,rdx

Determines if rax is a power of two larger than zero by comparing (rax ^ (rax - 1)) >> 1 and rax - 1. Both values will be equal if and only if rax is a power of two larger than zero. Note that the case rax == 0 will result on different values due to the right-shift operation.

See also: Snippet 0x2D.

Snippet 0x2F

    xor      eax,eax
.loop:
    jrcxz    .exit_loop
    inc      rax
    mov      rdx,rcx
    dec      rdx
    and      rcx,rdx
    jmp      .loop
.exit_loop:

This snippet stores the number of 1-bits in rcx into the rax register. This relies on the trick features in Snippet 0x2D to clear the rightmost 1-bit until rcx is zero.

See also: Snippet 0x2D.

Snippet 0x30

    and      rax,rdx

    sub      rax,rdx
    and      rax,rdx

    dec      rax
    and      rax,rdx

This snippet computes ((((rax & rdx) - rdx) & rdx) - 1) & rdx. This expression is equivalent to rax & rdx.

Snippet 0x31

    mov      rcx,rax
    shr      rcx,1
    xor      rcx,rax

    inc      rax

    mov      rdx,rax
    shr      rdx,1
    xor      rdx,rax

    xor      rdx,rcx

The snippet is nicely laid out in four stanzas to give us a hint of what's going on. The first and third calculate x^(x>>1) which transforms an index to the corresponding Gray Code sequence element (see also Sloane's A003188). Hence the whole snippet will calculate the xor of two consecutive Gray Codes. These will differ in exactly one bit which corresponds to the highest power of 2 that divides x+1 (Sloane's A006519). This is also equivalent to ~x&(x+1).

(thanks @eleemosynator)

Snippet 0x32

    mov      rcx,rax

    mov      rdx,rax
    shr      rdx,1
    xor      rax,rdx

    popcnt   rax,rax
    xor      rax,rcx
    and      rax,1

The Gray Code is at the center of this one as well. The snippet calculates (popcnt(x^(x>>1))^x) & 1 where popcnt counts the number of set bits in a register (population count, sometimes also referred to as weight). We can unpack this by using the distributive property of AND over XOR: (popcnt(x^(x>>1))&1) ^ (x&1). Now the first part is just the parity of the Gray Code of index x (lowest bit of weight tells us if there are an even or odd number of set bits) and the second part is the lowest bit of x and they will always be equal. One way to see this is to think of the parity as the XOR of all the bits in an integer, hence in calculating the parity of x^(x>>1) every bit of x will appear twice except for bit 0 which is shifted off the bottom and only shows up once. Another way to see this is by inspecting the Gray Code inversion formula which we are about to meet in the next snippet.

Snippet 0x33

    mov      rdx,rax
    shr      rdx,0x1
    xor      rax,rdx

    mov      rdx,rax
    shr      rdx,0x2
    xor      rax,rdx

    mov      rdx,rax
    shr      rdx,0x4
    xor      rax,rdx

    mov      rdx,rax
    shr      rdx,0x8
    xor      rax,rdx

    mov      rdx,rax
    shr      rdx,0x10
    xor      rax,rdx

    mov      rdx,rax
    shr      rdx,0x20
    xor      rax,rdx

Maps a Gray Code to its corresponding sequence number. This is the inverse operation of x^(x>>1). See also snippet 0x31.

rax := (rax >> 0x01) ^ rax
rax := (rax >> 0x02) ^ rax
rax := (rax >> 0x04) ^ rax
rax := (rax >> 0x08) ^ rax
rax := (rax >> 0x10) ^ rax
rax := (rax >> 0x20) ^ rax

See also: https://en.wikipedia.org/wiki/Gray_code

Snippet 0x34

    mov      ecx,eax
    and      ecx,0xffff0000
    shr      ecx,0x10
    and      eax,0x0000ffff
    shl      eax,0x10
    or       eax,ecx

    mov      ecx,eax
    and      ecx,0xff00ff00
    shr      ecx,0x8
    and      eax,0x00ff00ff
    shl      eax,0x8
    or       eax,ecx

    mov      ecx,eax
    and      ecx,0xcccccccc
    shr      ecx,0x2
    and      eax,0x33333333
    shl      eax,0x2
    or       eax,ecx

    mov      ecx,eax
    and      ecx,0xf0f0f0f0
    shr      ecx,0x4
    and      eax,0x0f0f0f0f
    shl      eax,0x4
    or       eax,ecx

    mov      ecx,eax
    and      ecx,0xaaaaaaaa
    shr      ecx,0x1
    and      eax,0x55555555
    shl      eax,0x1
    or       eax,ecx

Bit Reversal Permutation - the cornerstone of FFT algorithms. From the Bithacks page: https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel.

Snippet 0x35

    mov      edx,eax
    and      eax,0x55555555
    shr      edx,0x1
    and      edx,0x55555555
    add      eax,edx

    mov      edx,eax
    and      eax,0x33333333
    shr      edx,0x2
    and      edx,0x33333333
    add      eax,edx

    mov      edx,eax
    and      eax,0x0f0f0f0f
    shr      edx,0x4
    and      edx,0x0f0f0f0f
    add      eax,edx

    mov      edx,eax
    and      eax,0x00ff00ff
    shr      edx,0x8
    and      edx,0x00ff00ff
    add      eax,edx

    mov      edx,eax
    and      eax,0x0000ffff
    shr      edx,0x10
    and      edx,0x0000ffff
    add      eax,edx

The basic Population Count algorithm (count the number of set bits in an word). There is a marginally better implementation on the Bithacks page that uses a cheeky subtraction step at the start and fewer ANDs: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

Snippet 0x36

    dec      rax

    mov      rdx,rax
    shr      rdx,0x1
    or       rax,rdx

    mov      rdx,rax
    shr      rdx,0x2
    or       rax,rdx

    mov      rdx,rax
    shr      rdx,0x4
    or       rax,rdx

    mov      rdx,rax
    shr      rdx,0x8
    or       rax,rdx

    mov      rdx,rax
    shr      rdx,0x10
    or       rax,rdx

    mov      rdx,rax
    shr      rdx,0x20
    or       rax,rdx

    inc      rax

Maps positive values in rax to their next power of two, or itself if already a power of two. Non-positive values get mapped to 0.

This works by decreasing the number and replicating the most-significant non-zero bits to all their relatively least-significant bits, resulting in a value of the form 2^N - 1. Finally, the number is incremented to obtain the desired power of two.

Snippet 0x37

    mov      rdx,rax
    not      rdx
    mov      rcx,0x8080808080808080
    and      rdx,rcx
    mov      rcx,0x0101010101010101
    sub      rax,rcx
    and      rax,rdx

Another one from Bithacks: Determine if a word has a zero byte. Allows one to scan for the location of a zero (ASCIIZ terminator for example) by loading one machine word at a time.

Snippet 0x38

    bsf      rcx,rax

    mov      rdx,rax
    dec      rdx
    or       rdx,rax

    mov      rax,rdx
    inc      rax

    mov      rbx,rdx
    not      rbx
    inc      rdx
    and      rdx,rbx
    dec      rdx

    shr      rdx,cl
    shr      rdx,1

    or       rax,rdx

This little gem calculates the next biggest integer with the same weight (number of set bits). For example, it produces the following sequences (feeding it its previous output every step):

Try it in python:

# Snippet 0x38 - Calculate the successor of same weight
from __future__ import print_function

def bsf(x):
    n = 0
    while not (x & 1):
        x >>= 1
        n += 1
    return n

def popcount(x):
    n = 0
    while x:
        n += 1
        x &= (x - 1)    # Clear the bottom-most set bit - c.f. snippet 0x2f
    return n

def next_in_weight_class(x):
    l = bsf(x)
    d = x | (x - 1)
    x = d + 1
    d = (d + 1) & ~d
    x |= (d - 1) >> (1 + l)
    return x

def show_sequence(x, n=10):
    for i in range(n):
        print(popcount(x), bin(x))
        x = next_in_weight_class(x)

# Show first few weight classes

for w in range(1, 5):
    show_sequence((1 << w) - 1)   # First element of a weight class w repunit(w) = 2^w - 1

(thanks @eleemosynator)

Snippet 0x39

    mov      rdx,0xaaaaaaaaaaaaaaaa
    add      rax,rdx
    xor      rax,rdx

Computes the negabinary representation of rax.

Instead of having the binary basis (+1, +2, +4, +8, +16, +32, ...), negabinary numbers have the basis (+1, -2, +4, -8, +16, -32, ...). For instance, the number 3 (i.e. 0b11) gets mapped into 0b111 (i.e. 7) since 3 = 4 - 2 + 1.

Snippet 0x3A

    mov      rdx,rax
    neg      rdx
    and      rax,rdx

    mov      rdx,0x218a392cd3d5dbf
    mul      rdx
    shr      rax,0x3a

    xlatb

This snippet computes the amount of (left-most) leading zeros via De Bruijn sequences.

The first block determines the highest power of two dividing rax by computing rax := rax & (-rax). This results in the sequence A006519 whose first elements are: 0, 1, 2, 1, 4, 1, 2, 1, 8, ...

The second block together with the xlatb instruction computes rax := rbx[(rax * 218A392CD3D5DBF) >> 58] where rbx points to a De Bruijn table with 64 entries. This is equivalent to computing the binary logarithm of the previous 64-bit integer.

See also: http://supertech.csail.mit.edu/papers/debruijn.pdf

Snippet 0x3B

    cdq
    shl      eax,1
    and      edx,0xc0000401
    xor      eax,edx

This snippet computes:

if (eax & (1 << 31))
    eax = (eax << 1) ^ 0xC0000401
else
    eax <<= 1;

This snippet calculates the next state of a Galois LFSR with characteristic polynomial 0x1C000401 (x32 + x31 + x30 + x10 + 1). As this polynomial is primitive, the LFSR will cycle through all non-zero 32-bit integers. It also uses the cdq trick to expand the top bit into a mask which it then ands with the polynomial constant in order to avoid using a branch.

(thanks @eleemosynator)

Common in pseudo-random number generators and guarantees a period of 232−1. Several coefficients satisfy this property, though their number of terms in thir characteristic polynomial may vary. Similarly, coefficients for generators of period 264−1 exist as well. Libraries such as NoMSVCRT rely on this pattern, see cntr32 and cntr64.

(thanks @DamianFekete)

Snippet 0x3C

    mov      rbx,rax
    mov      rdx,rbx
    mov      rcx,0xaaaaaaaaaaaaaaaa
    and      rbx,rcx
    shr      rbx,1
    and      rbx,rdx
    popcnt   rbx,rbx
    and      rbx,1

    neg      rax
    mov      rdx,rax
    mov      rcx,0xaaaaaaaaaaaaaaaa
    and      rax,rcx
    shr      rax,1
    and      rax,rdx
    popcnt   rax,rax
    and      rax,1

    mov      rdx,rax
    add      rax,rbx
    dec      rax
    neg      rax
    sub      rdx,rbx

This snippet expects a number in rax and returns two outputs which are -1, 1 or 0 in rax and rdx. If we interpret the input as a sequence number and the outputs as steps in two dimensions (i.e. dx and dy) we can draw the result:

This truly mind-blowing snippet draws a Hilbert Curve in 21 assembly instructions without using recursion or branches. You can experiment with the Python implementation of the algorithm in xorpd_0x3c_hilbert.py (requires Pillow).

(thanks @eleemosynator)

Snippet 0x3D

    mov      rcx,1
.loop:
    xor      rax,rcx
    not      rax
    and      rcx,rax
    not      rax

    xor      rdx,rcx
    not      rdx
    and      rcx,rdx
    not      rdx

    shl      rcx,1
    jnz      .loop

The registers rax and rdx serve both as inputs and outputs of this snippet. Starting it at (0, 0) and treating the results as points in the 2D plane, we get the following pattern:

This is the Morton Curve or Z-order curve. It's based around the idea of visiting all points on the 2D plane in a hierarchical nearest-neighbour Z-pattern. To understand the generation algorithm consider the following method for incrementing an integer with N bits:

The loop in the algorithm above tracks the propagation of the carry with the loop exiting when it finds a 0 bit that can finally absorb the carry by turning into a 1. Expressed in the same language, the algorithm in the snippet does the following (naming the inputs x and y):

We have exactly the same carry-propagation structure except that the carry propagates from bit k of x to bit k of y before moving up to bit k+1 of x. This operation is exactly equivalent to incrementing an integer that is constructed by interleaving the bits of x and y, with the bits of x taking the even positions and the bits of y on the odd positions.

Both the snippet version and the interleaved increment version of the algorithm are implemented in the Python script xorpd_0x3d_morton.py (requires Pillow).

(thanks @eleemosynator)

Snippet 0x3E

    mov      rdx,rax
    shr      rdx,1
    xor      rax,rdx

    popcnt   rax,rax
    and      rax,0x3

Computes the direction of the lines in the (Heighway) Dragon Curve by computing popcnt(rax ^ (rax >> 1)) & 3. This produces the sequence A246960, the fixed point of the morphism {0 -> (0,1), 1 -> (2,1), 2 -> (2,3), 3 -> (0,3)}.

Snippet 0x3F

    mov      rbx,3
    mov      r8,rax
    mov      rcx,rax
    dec      rcx

    and      rax,rcx
    xor      edx,edx
    div      rbx
    mov      rsi,rdx

    mov      rax,r8
    or       rax,rcx
    xor      edx,edx
    div      rbx
    inc      rdx
    cmp      rdx,rbx
    sbb      rdi,rdi
    and      rdi,rdx

    bsf      rax,r8

This snippet takes a counter as input in rax and return two numbers between 0 and 2 in rsi and rdi and a number between 0 and 63 in rax. The use of rsi and rdi is probably a hint as they are the source index and destination index registers indicating that something is to be moved from one position to another. If we draw the outputs for successive values of rax we get:

This sequence is the solution to the Towers of Hanoi, and the snippet above implements its binary solution.

According to the legend, there is a hidden monastery in Hanoi which contains a large room with three time-worn posts and 64 golden disks. The monks, all sworn to secrecy, have been following rules of the Towers of Hanoi and moving the disks one at a time from peg to peg. It is said that when they have completed the sequence and all 64 golden disks are the middle peg then the universe will finally come to an end.

How appropriate then that the 64th and last snippet in the book solves a 64-disk Hanoi puzzle which once completed will signal the end of the universe. However there is one strange quirk - the third stanza that calculates the destination peg does something like (((m | m - 1) + 1) % 3 + 1) % 3 where the second reduction modulo 3 is done using the cmp/sbb/and pattern. Why not just increment m | m - 1 before the first reduction? It turns out that doing so would give the wrong destination for several of the end-game moves as m | m - 1 would become 264-1 and hence increment to zero instead of the intended 1.

The monks of Hanoi need not worry, if they follow the instructions of this snippet the universe will come to an orderly end as originally intended.

(thanks @eleemosynator)